Compiling Techniques Notes

Table of Contents

High level view of a compiler

Takes source code and produces machine code, often creates errors instead.

  • Must recognise legal (and illegal) code
  • Must generate correct code
  • Must produce understandable errors

Two pass compiler

Uses an intermediate representation (IR). Maps front end to IR and IR to machine code. Admits multiple front ends and multiple passes. Front is \(O(n)\) or \(O(n\log n)\) while back-end is NP-complete.

Common fallacy is front ends and back-ends can be swapped for every system. This is the Holy Grail of compiler research, and is currently still theory.

Front end chain

Lexer

Consists of a Scanner and Tokeniser. Scanner reads in text and tokeniser recognises words from.

Example: \(x=y+2\) becomes IDENTIFIER(x) EQUAL IDENTIFIER(y) PLUS NUMBER(2)

Parser

Recognises context free syntax and report errors. Easy to build hand coded parsers.

Semantic Analyser

Guides context sensitive (semantic) analysis

IR generator

Generates the IR for the rest of the compiler

Abstract Syntax Tree (AST)

Summarises grammatical structure

Back end

Translate IR to target machine code. Choose instructions for IR chain. Ensure conformance with system architecture.

Instruction selection

Pattern matching. Produce fast, compact code.

Register allocation

Limited registers, need to manage a limited set of resources. Can change instruction choices and insert LOAD and STORE. Optimal is NP-Complete as is usually a graph colouring problem.

Instruction scheduling

Avoid hardware stalls and interlocks. Use all functional units productively. Optimal is NP-Complete.

Middle end

Code improvement and optimisation. Analyse and improve IR. Aim to reduce the running time (or space or power) of compiled code, while preserving meaning.

Optimiser

Multiple passes through the optimiser to try and improve performance.

Restructurer

Translate high level IR to low level IR. Can be useful for "high level" optimisation.

The Lexer

Maps character stream into words - basic syntax units. Typical tokens: number, identifier, +, -, new, while, for. Scanner removes white-space and comments.

Context-free language

Specified in a grammar, e.g. SheepNoise \(\to\) SheepNoise baa | baa.

Regular Expressions

Represented using an augmented BNF notation Expression for all integers: ['-'] (('1'|…|'9') ('0' | … | '9')+ | '0')